Phonetic transcriptions in the spoken dutch corpus: how to combine efficiency and good transcription quality
نویسندگان
چکیده
This paper reports on an experiment aimed at establishing how phonetic transcriptions for the large CGN corpus can be obtained most efficiently. This experiment explores the po tential of an automatically generated transcription (AGT) by comparing an AGT with a reference transcription (Tref) of the same material, to determine whether and how the AGT can be improved to make it more similar to Tref. The results indicate that the AGT can be optimized through pronunciation variation modelling so as to make human corrections more efficient or even superfluous, at least for some speech styles.
منابع مشابه
How to Improve Human and Machine Transcriptions of Spontaneous Speech
This paper reports on an experiment aimed at measuring the quality o f automatic and human phonetic transcriptions of different speech styles that were produced within the framework o f a large speech corpus project for Dutch, the Spoken Dutch Corpus (C orpus Gesproken Nederlands, CGN). The results indicate that the procedure adopted in the CGN to improve the quality o f phonetic transcriptions...
متن کاملTitle : Automatic Phonetic Transcription of Large Speech Corpora
Most large speech corpora are delivered with a lexicon that contains a canonical transcription of every word in the orthographic transcription. Such a lexicon can be used for generating a hypothetical ‘canonical’ phonetic transcription from the orthography. In addition, time and money permitting, some speech corpora are provided with a manually verified broad phonetic transcription of at least ...
متن کاملRegional Bias in the Broad Phonetic Transcriptions of the Spoken Dutch Corpus
In this paper, we assess an aspect of the quality of the broad phonetic transcriptions in the Spoken Dutch Corpus (CGN). The corpus contains speech from native speakers of Dutch originating from The Netherlands and the Dutch speaking part of Belgium. The phonetic transcriptions were made by transcribers from both regions. In previous research, we have identified regional differences in the tran...
متن کاملAutomatic generation of phonetic transcriptions for large speech corpora
We describe a method for the automatic production of phonetic transcriptions in large speech corpora. First, we focus on the application of different techniques for the generation of pronunciation variants. Then, we explain the application of a speech recognition system for selecting the acoustically best matching phonetic transcription. The system is evaluated on different test sets selected f...
متن کاملThe Influence of the Labeller's Regional Background on Phonetic Transcriptions: Implications for the Evaluation of Spoken Language Resources
Phonetic transcriptions of spoken language corpora are not an exact written reproduction of the speech signal. They are influenced by a variety of factors such as the transcriber s native categorical perception. What remains unexplored is to what extent variation of perception within the same language exerts any influence on phonetic transcriptions. We report a case study of the labelling of vo...
متن کامل